Add poison message handling to the dispatchers#1366
Conversation
…ase of poison message handling, except for entity unlock requests
Co-authored-by: Chris Gillum <cgillum@microsoft.com>
Co-authored-by: Chris Gillum <cgillum@gmail.com>
…ad for trace activities
…vent for json deserialization, etc.
| this.Reason = reason; | ||
| } | ||
|
|
||
| // Private ctor for JSON deserialization (required by some storage providers and out-of-proc executors) |
There was a problem hiding this comment.
Unrelated to this PR but I bug I found when testing (JSON was not able to deserialize this event because it lacked a 0-arg constructor and the other constructors all had multiple parameters)
There was a problem hiding this comment.
Also unrelated to this PR, but I realized while working on it that this code I wrote a while back had some incorrect assumptions so I took the opportunity to fix it
There was a problem hiding this comment.
Pull request overview
This PR adds an extensibility hook (IPoisonMessageHandler) and integrates poison/invalid message detection into the core dispatchers so that corrupted or “poisoned” inputs can be handled deterministically (e.g., fail orchestration/activity/entity work) instead of always throwing.
Changes:
- Introduces
IPoisonMessageHandlerand wires it into orchestration/activity/entity dispatchers for invalid work items and poison message detection. - Adds structured logging support for poison-message detection (new event ID + event source + log event).
- Adds dispatch-count tracking on history events and propagates poison metadata through entity request processing.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/DurableTask.Core/Tracing/TraceHelper.cs | Adjusts entity invocation activity ending to better handle partial result sets. |
| src/DurableTask.Core/TaskOrchestrationDispatcher.cs | Adds poison detection/handling and updates reconciliation to return a drop reason. |
| src/DurableTask.Core/TaskEntityDispatcher.cs | Adds poison detection/handling for entity messages, plus poison-aware batching/result shaping. |
| src/DurableTask.Core/TaskActivityDispatcher.cs | Adds poison/invalid handling for activity scheduling messages (including failing poisoned tasks). |
| src/DurableTask.Core/Logging/StructuredEventSource.cs | Adds a new structured event for poison message detection. |
| src/DurableTask.Core/Logging/LogHelper.cs | Adds PoisonMessageDetected helper overloads emitting structured logs. |
| src/DurableTask.Core/Logging/LogEvents.cs | Adds a new structured log event type for poison messages. |
| src/DurableTask.Core/Logging/EventIds.cs | Reserves a new event ID for poison message detection. |
| src/DurableTask.Core/IPoisonMessageHandler.cs | New interface defining poison detection and handling hooks. |
| src/DurableTask.Core/History/HistoryEvent.cs | Adds DispatchCount to history events for poisoning heuristics/telemetry. |
| src/DurableTask.Core/History/ExecutionRewoundEvent.cs | Adds a parameterless ctor for JSON deserialization compatibility. |
| src/DurableTask.Core/Entities/OrchestrationEntityContext.cs | Adds AbandonAcquire() to reset lock acquisition state on failure. |
| src/DurableTask.Core/Entities/EventFormat/RequestMessage.cs | Adds poison metadata fields used during entity request processing. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
… combined' Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
… combined' Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
cgillum
left a comment
There was a problem hiding this comment.
Some initial comments. I haven't gone through the dispatcher code yet (those are bigger diffs).
| /// If the request message is poisoned, the reason it is poisoned. | ||
| /// Otherwise, null. | ||
| /// </summary> | ||
| public string? PoisonReason { get; set; } |
There was a problem hiding this comment.
What's the use case for PoisonReason in entity request messages?
There was a problem hiding this comment.
It's used when generating the failure response for a poisoned entity operation, i.e. here.
Since processing the history event that corresponds to the entity request is decoupled from sending the response, we need to store the poison reason in the request message so we can use it later to populate the failure details.
| /// <param name="historyEvent">The history event being dispatched.</param> | ||
| /// <param name="reason">Why the message is considered poisoned.</param> | ||
| /// <returns><c>true</c> if the message should be treated as poisoned; otherwise <c>false</c>.</returns> | ||
| public bool IsPoisonMessage(HistoryEvent historyEvent, out string? reason); |
There was a problem hiding this comment.
Why do we need the backend to decide whether something is a poison message? Shouldn't the dispatcher be making this decision?
There was a problem hiding this comment.
We don't, and it definitely can. I thought it would be the responsibility of the "poison message handler" since it's sort of in the name but this was just a somewhat arbitrary choice on my end. I can return responsibility to the dispatchers
There was a problem hiding this comment.
One thing that's tricky about reviewing this interface is that we don't yet have any implementations of it. I'm thinking we should probably implement this for Azure Storage before committing to these new public APIs.
There was a problem hiding this comment.
Actually one thing I remembered is that doing it this way meant I didn't have to pass a maxDispatchCount when instantiating the dispatchers, which meant we didn't need yet another overload for the TaskHubWorker to accept this parameter when it creates them. But if we want to keep this decision on the dispatcher side and avoid another parameter, we can maybe expose this via IPoisonMessageHandler.MaxDispatchCount, if that sounds reasonable?
cgillum
left a comment
There was a problem hiding this comment.
Finished reviewing. Just a few more comments (and some responses).
| /// <param name="historyEvent">The history event being dispatched.</param> | ||
| /// <param name="reason">Why the message is considered poisoned.</param> | ||
| /// <returns><c>true</c> if the message should be treated as poisoned; otherwise <c>false</c>.</returns> | ||
| public bool IsPoisonMessage(HistoryEvent historyEvent, out string? reason); |
There was a problem hiding this comment.
One thing that's tricky about reviewing this interface is that we don't yet have any implementations of it. I'm thinking we should probably implement this for Azure Storage before committing to these new public APIs.
| @@ -1,4 +1,4 @@ | |||
| // ---------------------------------------------------------------------------------- | |||
| // ---------------------------------------------------------------------------------- | |||
There was a problem hiding this comment.
It might be good to have @sebastianburckhardt review the changes to this file as I'm less familiar with the details of entity dispatching.
This PR introduces poison message handling to the dispatchers. This is done by
IPoisonMessageHandlerthat the any orchestration service which has poison message handling is expected to implementThe orchestration service is otherwise responsible for determining what to do with the poison message(s) and how to store them.